In the marketing area, promotions and product offerings that are not well targeted have a negative impact on the company. Like it will consume resources in terms of energy and time, even financial losses if there are expenses incurred to hold the marketing program. For that we need to build a strategy so that marketing programs can run well and on target.
Now it is clear that marketing strategy is very crucial, including in banking companies. Customer segmentation is one of the most widely used strategies, because it is believed to be able to increase profits because the promotions are right on target. This strategy is considered very effective and efficient in terms of profit, time, and energy.
In this case, customer segmentation is done based on the use of credit cards. We will use the unsupervised machine learning technique which is clustering to to create segmentation.
Clustering in this case uses the PyCaret library. PyCaret is a machine learning library in Python that makes it very easy for programmers from data preparation to model implementation. This library can be used to perform end-to-end machine learning. This includes imputing missing values, encoding categorical data, feature engineering, hyperparameter tuning, and building models. PyCaret is also very powerful for applying various machine learning methods with a faster time, including clustering. So instead of spending time on coding, we can focus more on the business problem itself.
More details about PyCaret Library can refer to the official documentation: https://pycaret.org/
The data used is CC GENERAL.csv downloaded on local drive from Kaggle. The following code is reading the data using pandas then creating a preview for the dataset.
import pandas as pd
data = pd.read_csv("CC GENERAL.csv")
data.head()
Description of the Dataset:
CUSTID : Identification of Credit Card holder (Categorical)BALANCE : Balance amount left in their account to make purchases BALANCEFREQUENCY : How frequently the Balance is updated, score between 0 and 1 (1 = frequently updated, 0 = not frequently updated)PURCHASES : Amount of purchases made from accountONEOFFPURCHASES : Maximum purchase amount done in one-goINSTALLMENTSPURCHASES : Amount of purchase done in installmentCASHADVANCE : Cash in advance given by the userPURCHASESFREQUENCY : How frequently the Purchases are being made, score between 0 and 1 (1 = frequently purchased, 0 = not frequently purchased)ONEOFFPURCHASESFREQUENCY : How frequently Purchases are happening in one-go (1 = frequently purchased, 0 = not frequently purchased)PURCHASESINSTALLMENTSFREQUENCY : How frequently purchases in installments are being done (1 = frequently done, 0 = not frequently done)CASHADVANCEFREQUENCY : How frequently the cash in advance being paidCASHADVANCETRX : Number of Transactions made with "Cash in Advanced"PURCHASESTRX : Numbe of purchase transactions madeCREDITLIMIT : Limit of Credit Card for userPAYMENTS : Amount of Payment done by userMINIMUM_PAYMENTS : Minimum amount of payments made by userPRCFULLPAYMENT : Percent of full payment paid by userTENURE : Tenure of credit card service for userdata.info()
Based on the inspection, we know that in the CC GENERAL.csv data, there are 8950 customers who use credit cards. So the ultimate goal of this case is to group each of these customers.
After reading and checking the data, the next step is to initialize the PyCaret environment using the setup() function
This step generates a pipeline that prepares the data for the implementation of the model to be created. In this case, some parameter settings are performed as follows:
normalize = True : due to the different scale ranges between features, scaling is considered necessaryignore_features = ['CUST_ID'] : this column only stores unique values for the customers that are considered unnecessary for clusteringsession_id = 123 : lock the clustering results so that they are always the same. The session_id is set as 123 for later reproducibilityfrom pycaret.clustering import *
s = setup(data, normalize = True, ignore_features = ['CUST_ID'], session_id = 123)
The create_model() function allows us to easily create and evaluate clustering models. By default, the create_model() function creates 4 clusters. If we already know the number of clusters in our data, then we just need to set it by adding the num_clusters parameter. In this case since we don't know how many clusters there are, we will use the default values. However we will add the parameter num_clusters = 4 just for demonstration purposes
There are many clustering algorithms provided by PyCaret, to see the list use the models function.
models()
After performing the function, a number of performance metrics are generated, including Silhouette, Calinski-Harabasz, and Davies-Bouldin. We will focus on Silhouette Coefficients.
kmeans = create_model('kmeans', num_clusters = 4)
The mean Silhouette Coefficient having a range between -1 and 1. A negative values indicates that the instance has been assigned to the wrong cluster, while a values near to 0 indicates that clusters overlap. On the other hand, a positive values close to 1 indicates the correct assignment. This Silhouette Coefficient is 0.2081 which considered as quite good since we have quite a lot of customers in our data.
The evaluate_model() function aims to analyze the model performance of the clustering results. The outcome will display several types of plots, namely Cluster 2D, Cluster 3D, Elbow plot, Silhouette, Distance, and Distribution at once. To do so, you can run the following code.
evaluate_model(kmeans)
In this project we will show only a few types of plots, for that we can use the plot_model() function. The plot_model() function is used to create various graphs or visualizations of our model's performance.
The plots that will be shown are 2D cluster, Elbow, and Distirbution.
plot_model(kmeans, plot = 'cluster')
The 2-Dimesion Cluster PCA Plot shows that K-Means divides the cluster well. Based on the results of visual inspection, the four clusters are separated in satisfactory way, seen from the distribution of groups based on color, it is quite visible. Although there are still some overlapping points.
plot_model(kmeans, plot = 'elbow')
The elbow method shows the optimal number of cluster. In this case, the elbow plot suggest that 5 is the optimal number of cluster. Next we will try to create a model using k = 5.
create_model('kmeans', num_clusters = 5)
Because the Silhouette coefficient is smaller than the defalut parameter, we decided to stick with 4 clusters.
The distribution plot provides the graphical the presentation of how many customer of each cluster.
plot_model(kmeans, plot = 'distribution')
The distribution plot shows the size of each cluster. Cluser 0 has most of the sample, which is almost reach 4000 samples.
This assign_model() function assigns a cluster label to the data. Then a new column will be created, namely cluster which indicates which cluster the customer is grouped into.
result = assign_model(kmeans)
result.head()
PyCaret also has a function predict_model() which aims to predict clusters on new data.
Before making predictions, let's build a dummy data that shows there are 2 new customers who use credit cards.
# initialise data of lists.
new_customer = {'CUST_ID':['C19191', 'C19192'], 'BALANCE':[1564.474828,503.432897], 'BALANCE_FREQUENCY':[1.000000,0.777776],
'PURCHASES':[1093.25,103.33], 'ONEOFF_PURCHASES':[1093.25,0.00], 'INSTALLMENTS_PURCHASES':[0.0,300.00],
'CASH_ADVANCE':[6442.945483,0.000000], 'PURCHASES_FREQUENCY':[0.083333,1.000000],
'ONEOFF_PURCHASES_FREQUENCY':[0.083333,1.000000],'PURCHASES_INSTALLMENTS_FREQUENCY':[0.083333,1.000000],
'CASH_ADVANCE_FREQUENCY':[0.350000,0.000000], 'CASH_ADVANCE_TRX':[6,2], 'PURCHASES_TRX':[12,1],
'CREDIT_LIMIT':[10000.0,2000.0], 'PAYMENTS':[5103.032597,401.802084], 'MINIMUM_PAYMENTS':[2072.340217,107.340217],
'PRC_FULL_PAYMENT':[0.222222,0.000000], 'TENURE':[12,12]}
# Create DataFrame
new_data = pd.DataFrame(new_customer)
predictions = predict_model(kmeans, data = new_data )
predictions.head()
The prediction was successfully carried out where customers with ID C19191 were grouped in cluster 2 while customers with ID C19192 were grouped in cluster 1.
If the model wants to be embed in an application, we can save the kmeans model that has been built successfully before using save_model() function.
save_model(kmeans, 'kmeans_pipeline')
Then to call the saved model is very simple, using the load_model() function. If the model is considered satisfactory enough, then we just need to embed the model in the application we want to deploy.
loaded_model = load_model('kmeans_pipeline')
print(loaded_model)
Last part on this project is we are going to interpret the characteristics of each cluster by setting specific features on the distribution plot with plot_model() function.
plot_model(kmeans, plot = 'distribution', feature='PURCHASES_INSTALLMENTS_FREQUENCY')
Clusters 0 and 2 are observed to have customers whose purchases in installments are not frequently being done. Meanwhile, clusters 1 and 3 tend to settle their installments frequently.
plot_model(kmeans, plot = 'distribution', feature='CREDIT_LIMIT')
Cluster 3 has the largest credit card usage limit with a median value of $9000, followed by cluster 2 with a median of $7000. Meanwhile, clusters 0 and 1 have a smaller usage limit with a median value of $2500 and $3000, respectively.
plot_model(kmeans, plot = 'distribution', feature='PAYMENTS')
Customers in cluster 3 tend to make large payments compared to other clusters.
plot_model(kmeans, plot = 'distribution', feature='BALANCE')
Another interesting point was found in cluster 2, where the balance amount left in their account was known to be relatively large
Cluster 2 and cluster 3 are detected to have a balance that tends to be large, but the usage in cluster 2 is relatively smaller than cluster 3
PyCaret library.plot_model() is able to show the characteristics shown by each cluster. This can be used by the marketing team to promote and offer products that are more suited to the circumstances of each customer. So that the possibility of customers to accept the offer or promotion is getting bigger.